Final Project: Quarterly Conference Call Transcripts

David Fuentes (dmf4ns)

DS 5001

University Of Virginia

Summer 2021


Section 1: Transcript Collection

Purpose: This section contains the code used to pull the raw transcripts from Motley Fool's conference-call transcript service.

Note that this section will scrape Motley Fool's site if you run it. Depending on your Chrome settings and whether you have ChromeDriver, this may not run. Regardless, the code needs to be altered to reference ChromeDriver's location on your computer.

Data output from this section flows into Section 2 and will be included with submitted files.

Set up list of companies whose data the program will scrape:

Note: Motley Fool times out after scraping about 7 companies. Waiting some time before scraping again works well. Each company has about 2 years' worth of transcripts on the MF site.

I used S&P companies for this analysis. This markdown cell contains the remaining companies along with those I scraped for my own purposes (so I could copy and paste into the companies list each time I ran).

Remaining Companies: 'AON','ROP', 'JCI', 'BIIB', 'NEM', 'FCX', 'KMB', 'PSA', 'IQV', 'MSCI', 'KLAC', 'TT', 'A', 'TROW', 'LHX', 'EBAY', 'EXC', 'DLR', 'TEL', 'CMG', 'DOW', 'ALGN', 'GD', 'AEP', 'INFO', 'MET', 'DXCM', 'SNPS', 'ORLY', 'CNC', 'ROST', 'EOG', 'SRE', 'EA', 'BAX', 'APH', 'APTV', 'ALXN', 'CARR', 'PPG', 'DD', 'AIG', 'ALL', 'SPG', 'CDNS', 'BK', 'TRV', 'PRU', 'STZ', 'PH', 'SLB', 'GIS', 'RMD', 'XEL', 'MSI', 'SBAC', 'MCHP', 'SYY', 'DFS', 'CTSH', 'WELL', 'PAYX', 'IFF', 'MAR', 'FTNT', 'CMI', 'AZO', 'OTIS', 'MNST', 'YUM', 'ROK', 'MTD', 'KMI', 'CTAS', 'TDG', 'HPQ', 'MPC', 'AFL', 'FRC', 'WBA', 'HLT', 'SWK', 'ADM', 'XLNX', 'PXD', 'KR', 'ZBH', 'AVB', 'SWKS', 'PSX', 'FAST', 'AME', 'CTVA', 'PEG', 'AWK', 'EFX', 'GLW', 'PCAR', 'WMB', 'VRSK', 'MCK', 'WEC', 'ANSS', 'ES'

Finished: 'AAPL', 'MSFT', 'AMZN', 'FB', 'GOOGL', 'GOOG', 'BRK.B', 'TSLA', 'NVDA', 'JPM', 'JNJ', 'V', 'UNH', 'PYPL', 'HD', 'PG', 'DIS', 'MA', 'BAC', 'ADBE', 'CMCSA', 'XOM', 'NFLX', 'VZ', 'INTC', 'CRM', 'CSCO', 'PFE', 'KO', 'ABT', 'ABBV', 'PEP', 'NKE', 'T', 'TMO', 'CVX', 'ACN', 'AVGO', 'MRK', 'WMT', 'LLY','WFC', 'COST', 'TXN', 'DHR', 'MCD', 'MDT', 'QCOM','EPZM','ORCL','HON','UPS','PM','LIN','BMY','NEE','UNP','C','AMGN','SBUX','LOW','INTU','MS','BA', 'RTX','BLK','GS','AMT','IBM','AMAT','TGT','CAT', 'MMM','GE','ISRG','AXP','NOW','DE','AMD','CVS', 'SCHW','CHTR','SPGI','ANTM','ZTS','PLD','LMT', 'BKNG','FIS','LRCX','SCHW','CHTR','SPGI','ANTM','ZTS', 'PLD','LMT','BKNG','FIS','LRCX','MU','MDLZ','MO', 'CCI','ADP','SYK','TMUS','GILD','TJX','PNC','COP','CI','USB','DUK','CME','GM','TFC','COF','EL', 'EQIX','CB','FDX','CSX','BDX','MMC','ATVI', 'ILMN','CL','ICE','EW', 'SHW','NSC','ITW','SO','FISV','APD','ADSK','ADI','ETN','BSX', 'D','REGN','MCO','EMR','HUM','PGR', 'HCA','F','NXPI','IDXX','GPN','TWTR', 'HCA','F','NXPI','IDXX','GPN','TWTR','WM', 'NOC','FCX','ECL','AON','BIIB','DG','VRTX', 'AON','ROP','JCI','BIIB','NEM','FCX','KMB', 'PSA','IQV','MSCI','KLAC','TT','A','TROW','LHX','EBAY','EXC','DLR','TEL','CMG','DOW','ALGN','GD','AEP','INFO','MET','DXCM','SNPS', 'ORLY','CNC','ROST','EOG','SRE','EA','BAX','ALL', 'SPG', 'CDNS', 'BK','TRV', 'PRU', 'STZ', 'APH','APTV','ALXN','CARR','PPG','DD','AIG', 'AON','ROP','JCI','BIIB','NEM','FCX','KMB'


Scraper 1: Get Transcript URLs For Each Company

Open each URL from Motley Fool for each ticker in the companies list; this opens the company's dashboard page. The transcript links within the main page for each company contains call-transcript, so I use BeautifulSoup to save down any link on the page containing that sub-string (with a list comprehension). I append all the transcript URLs for each company to the same list.


Scraper 2: Collect Transcript Data From URLs

The first scraper created a list of transcript links for the companies in the companies list. We can now open each transcript and pull the raw transcript data.


This process was run about 30 times to pull sufficient transcript data. I am sure I can figure out a more practical way to get around the time-out errors from Motley Fool (or I could have automated a batch run every few hours for a new set of companies), but I simply ran this file whenever I remembered to over a week or so, and it wasn't a problem.

Section 2: Raw Transcript Transform


Purpose: This section contains the code used to transform the raw transcript data pulled from Motley Fool

YOU NEED TO CHANGE THE PATH BELOW. The rest of the input/output should work properly.

Get file names with raw transcript data.


Create a data frame containing all of the raw transcript data:

View the raw data and save as F0:


Begin transcript cleaning:

Boilerplate Removal

The transcripts contain many of the same lines for things like advertisements and other boilerplate items. I want to get ride of these.

As an example, check some of the lines that contain Motley Fool, which I know should not be mentioned in any transcripts.

Create a list of these common transcript lines we want to remove.

Remove lines that likely aren't transcript data:

We now have a decently clean set of transcripts without the boilerplate/common lines that we aren't interested in. I will continue to clean these data, but this was a good first pass.


HTML Cleanup

Identify lines containing a speaker name.

Clean up HTML tags and produce clean transcript data:


Create Columns for Data

I now have some cleansed transcript data. I want to start adding some helpful columns to my table:

Add an identifier in the table labeling where the transcript data changes from one company to another.

Add an iterator to identify each line in each of the transcripts. Mostly the lines represent changes in paragraphs. I later add a column identifying speaker handoff as well.

Add company name to the DF. I later map to my S&P table on ticker for consistency (not all transcripts used the same naming convention for a company).

I now have the company name. Get ticker from URL as well.

Tickers are in with some of the odd cases fixed. Now add the date of the calls:

Better way to get quarter from URL

Extra ad cleaning:

I had a speaker map in which I looked through the auto-generated speaker list to make sure nothing looked too off. I cleaned in Excel on the side, so now I am updating. If the transcript line is entirely a speaker name, then I know the transcript changed speakers, so I will changed the name in the speaker_full column.

Clean up speaker full so there is just a speaker column as well:


Add S&P info from the mapping table (sector, HQ, etc.)

The raw data are transformed and cleansed. They were sent to a CSV. There are 196 companies in the data with about 2000 conference calls.

Quick Check on the Data

Secton 3: Get Yahoo Finance Data


Purpose: This section connects to Yahoo Finance and pulls the share-price information and market caps for each of the companies in the clean dataset. I also add a Q&A section.

The code below is a test snippet on Amazon, which is later built out into the full loop on the entire dataset.


Read in the clean transcript data:

Get a DF containing just the tickers and dates within the file:

Get min and max call dates for each company:

Test on Amazon's min date. The idea is to get the minimum date's stock price through to the day the data were pulled in a DF. Later on, this DF can be used as a reference table to provide price information needed for modeling.

Loop through grouped DF to find the prices for each ticker. Yahoo Finance also contains market caps, but I couldn't look by date, so they are all as of when this file was last run (August 12 in the case of my data). This isn't ideal, but it's okay for my purposes and is something I will look into further. I add these market caps to a dictionary to be used later.

Collect Yahoo Finance Data

Show dictionary containing market caps for each company:

Show price table:

Merge the two (i.e. map the dictionary to the price table)


Merge the price and market-cap table to the cleansed data

Now the closing prices per company as of each call date are in the data.

Create columns for response variables (closing price as of call plus N days)

We now want to create a set of columns representing weekly days out from the conference call date up through 3 months out (i.e. roughly the next earnings call)

Now we can lookup the prices for each of these dates from the pricing table

The following loop creates a column called close_n which will contain the close price n days from the call date (this date is found in the columns created above). It uses the DF created in the Yahoo Finance step as a lookup table.

Additional Data Checks


Create a question and answer flag

There is a pattern on the speaker_full column to identify an analyst from a bank, research firm, etc. vs a company employee. If it's too early in the call, then it shouldn't be an analyst. I don't want to start the QA too soon.

Check some of the data from both splits of Q&A to make sure they pass the eye test.

Add market cap to the data frame:

Write new file to a CSV:

Make a smaller DF with price information:

Section 4: Create OHCO and Analysis Files


Purpose: This file creates the OHCO files and digital corpus (F1, F2, etc.)


Data are in a DF. Some extra cleaning (I don't think this is necessary any longer; pretty sure the data are clean and this is an artifact of something I worked on a few weeks ago.).

df.quarter.drop_duplicates()

Create LIB table (contains data URL for each call, the file containing the raw data, etc.)

Start making OHCO files:


Group data by Call to produce transcripts by quarter, ticker, and Q&A split.

Merge price info to call data:


Group data by Speaker and Call to produce transcripts by speaker, quarter, ticker, and Q&A split. This should have each handoff between speaker in each call.


Group data by company: just a big string of calls per ticker (probably not too important).


Group data by Speaker and add in Q&A:

Tokenize calls:

Create tokens from each call:

Tokenize even lines (tokenizing all at once gave me memory issues, so I split to odds and evens):

Tokenize odd lines:

Create terms:

Create VOCAB table:

Get just the POS

Get MAX POS since the POS tagging per word might be wrong. Hopefully using MAX averages out the wrong POS taggings.

Remove stop words:

Stem the terms and add to the VOCAB and TOKEN tables:

Send TOKEN and VOCAB tables to CSVs.

Section 5: Text Analysis


Purpose: This section runs various analyses, such as TF-IDF, PCA, topic modeling, emotion/sentiment analysis, etc.

Setup (including TF-IDF function to be used later). Bring in TOKEN, VOCAB, etc.

Read in TOKEN and VOCAB tables:


TF-IDF Analysis

Run TFIDF on the reduced TOKEN table and show important words:

DTCM:


PCA Analysis

Get top terms from the TFIDF output and start eigen value and vector creation for PCA:

Show loadings:

Words associated with top PCs are very clear: We see medical terms, financial terms, and tech and infrastructure/energy. See plots below and report for more.

DCM

PCA and Loadings Graphs

View loadings:


LDA Topic Modeling

The transition CSV contains a DF with transitions between speakers in each conference call. This is sort of like a paragraph change in a book, so I will be thinking about it this way.

Adding the index including Q&A flag:

Create Phi and Theta tables.

Get topics:

See the difference in top terms between presentation and Q&A

Visualize company presentation vs. Q&A

See report for more details on these graphs


Sentiment and Emotion Analysis

Interesting that the polarity is higher in Q&A -- maybe because people are generally congratulatory? Complexity is higher.


t-SNE

There is so much data, so break out by ticker for now. Add tickers to the val list to run more. Can also adjust to run for sectors.


VADER

Prelim analysis:

Apply polarity scores to each transition in the 'paragraph' data frame.

Apply the same for the data frame with Q&A flag.

Apple sentiment chart for Q1 2021 (just as an example):


Modeling Setup:

Sent Analysis run on transition from speaker to speaker. I figure this will be better for more proper sentiment analysis on smaller chunks; then the data are aggregated up to call.

Create some functions to help with analysis:

Run VADER on each transition prior to aggregating.

Add features for modeling based on POS tagging (divided by word count):

I want to drop the excess date columns as they are not necessary. I do this with list comprehension.

Create dictionary for aggregating:

Get relative ratio of the different POS counts by dividing by word count:

Add word count column to be used as a feature variable in the models:

Get Q&A word count into the DF

Section 6: Modeling


Purpose: Use feature variables created in previous files based on text transcripts to predict a stock's price movement.

Pre-processing

I use the file that is not split by Q&A for the modeling. I have word count by Q&A as a variable, which I think is important as a feature from the Q&A work, but I am not sure the file is too helpful as each of the prices are duplicated for that file, which may hurt the predictive power of the resulting model.

I one-hot encode the categorical data (i.e. the sectors):

See quantitative variables:

The features are at many different scales, so I want to normalize to avoid potential issues in the modeling:

Add percentage change, which I also want to model. I do this with a loop to calc relative change for each 7-day period from the starting date. The percentage change is now a predictor in addition to the closing price +N weeks.

Bring back the OHE and quantitative vars (i.e. POS, word counts, sentiment):

Index the DF to get the model DF (with all variables, both features and predictors):

Add the company info to the model-variables DF:


Modeling:

Create function to return two dataframes: one containing the test DF with predictions as well as the training DF with predictions. DFs contain other statistics and metrics to help determine model performance. Input metric can be 'close' for the closing price at day N (multiples of 7 days) or 'per' for the percentage change. These metrics represent the predictor variables. Can also pass the model into the function.

Setting test_train = True returns one DF with a test/train flag.

Show MSE for model on test and training data:

Produce Some CV models:

Test function:

As an example, return the two DFs with predictors. We can see the error for both test and train, which can help determine if there is overfitting.

R^2 are similar, which is good.

Plot actual percentage gain against market cap. I show test-train split to make sure the data don't seem overfit. This is for call date +90 days.

Show the rel_err from the model (predicted gain less actual) by market cap. Notice more volatility at lower MCs. Call date +90

Last initial viz by sector and rel_err vs market cap. Same call date +90.


Model Performance Analysis I want to see, out of all the models, which performs the best. I run all models 90 times (once per number of days out) to predict closing price. I then group the output and analyze.

Lookup R2 values for test and train (these are output from the model functions, which I fed into dictionaries).

FULL METRICS TABLE. One row per model per day post call (1-90 days)

Top 2 models by sector and test-train split, sorted by the MSrelE metric:

Notice that LassoCV model largely performs the best or second best. Use as champion model.

Show some quick distributions to see if there are data oddities. Notice slight imbalance by sector and market caps.


Visualize Results

I made some relatively simple graph functions to take in my info and visualize. They are pretty flexible, especially with the model_ function, which is also flexible.

Loop through list of days out and graph.

The plots below show the predicted close price at day n (printed above the graph) against the actual day n close price. These are for the Lasso CV model, and I show prediction increments of 1, 2, 3, 7, 14, and 90 days. Notice the predictions begin to disperse as time increases; there is more spread around the parity line.